The ultimatum game

In this notebook several simulations of the ultimatum game are run using the player_1, player_2 and ultimatum classes defined in the other modules of this repository.

One shot repeated game

First, one simulation is runned to understand how each works.

Partial results of the 10.000 episodes played are shown:

Results for agent A.

In the plot we can see the the offers made by agent A in each episode of the simulation, every dot is an offer, the lines are made of consequent offers of the same level. The colors represent the action by agent B in each episode. It can be seen the process of learnig by the agent A, exploring the options and realizing that decresing the offer makes for higher rewards.

In the distribution made out of the agent A's actions it can be seen how the majority are made towards the lower bound of the endowment.

The rewards plot for agent A show the growth caused by the learning process.

Results for agent B.

In the plot we can see the the average of all actions taken by agent B for each episode. It can be observed how the agent learns fast that accepting every offer makes for a higher reward than rejecting, for that the action thends to be "accept" the offer, that is action 1.

The next histogram shows that most of the actions by agent B in the simulation are of value 1.

However, because of the learning process for agent B, the rewards decrease over time, since agent A learns that agent B will accept evey offer.

Exploring parameters for the agents

In this part of the notebook, several simulations are run with different values of epsilon and alpha to identify the optimal value for each agent.

epsilon

Epsilon will take the representative values: 0.01, 0.1, 0.5 and 1 for each player, the average across simulatons for each episode is studied.

Results for agent A

The partial results of the average action for each episode for each value of alpha is shown in the table.

In the plot it is clear that with an epsilon too high or too low the agent does not learn fast how to minimize the offers to maximize the rewards. The fastest and most consistent learning pattern is found when epsilon takes the value of 0.1.

The partial results of the average reward for each episode for each value of alpha is shown in the table.

Given the faster learning process when epsilon is set to 0.1, the rewards are also maximized in this case.

The average total payoff for different values of epsilon confirm that for agent A the optimal value of epsilon is 0.1.

Results for agent B.

In the table, the average action taken by agent B is shown for different episodes.

In the Plot it can be observed that with an epsilon of value 0.1 the agent learns faster to accept every offer made by agent A.

The table shows the average reward for some episodes according to different values of epsilon. It can be seen how the rewards increase with epsilon.

In the plot it is clearer how the rewards for agent B fastly decrease when the epsilon parameter allows it to learn faster to accept all offers. Indeed, with this behavior, the agent A learns faster to minimize the offers, for that, rewards are minimized for agent B.

The average total payoff for every value of epsilon confirms that for agent B the optimal value of epsilon is 1. Like this, the agent randomly choses an action in every episode, making it difficult for agent A to learn to minimize the offers and therefore maximizing the rewards.

alpha

Alpha will take the representative values: 0.01, 0.1, 0.5 and 1 for each player, the average across simulatons for each episode is studied.

Results for agent A

The partial results of the average action for each episode for each value of alpha is shown in the table.

The plot below shoes the path of offers for different alpha values. It can be seen that in average the offers grow fast in the first thousand of episodes for later slowly decrease, towards the theoretic solution of the game, when the alpha is 0.1 or 0.5. On the other hand, when alpha is either too small (0.01) or too large (1), the offers grow during the first thousands of episodes and then stay rather constant.

As for the rewards for agent A, they seem to grow with time when the alpha value is set to be 0.1 or 0.5 and stay high for the rest of the simulation. At the same time, when the alpha value is too low the rewards grow very slowly during the timelapse of the simulation.

The average of total payoff across simulations show that the value of 0.5 is the optimal for agent A.

Results for agent B.

In the table, the average action taken by agent B is shown for different episodes.

The plot shows the evolution in each episode across simulations for the action taken by agent B. It is seen that the path for actions grows faster when the value of alpha is .

The rewards increase sharply and then decrease during the simulations when the alpha value is 0.1 or 0.5, while they grow slowly when alpha is either too small (0.01) or too large (1).

From the average of total payoffs for each value of alpha it can be concluded that the best value of alpha for agent B is 1.

New game with best parameters for each player

For this new set of simulations the set of parameters will be different for each agent, using the optimal values for each parameter. A total of 50 simulations will be run to analyse the results. It is necessary to keep in mind that the value of alpha for player B ends up being non important, because if epsilon is equal to 1, all actions are randomized and not depending on the values of the q table for agent B.

Results for agent A.

The table below shoes the path of the offers made by agent B for each episode of the game. It can be seen that by the end of the simulations the value offered by the agent is far from being the minimum.

The plot below makes it more clear to understand that inspite the learning process for agent A, the randomness in the actions of agent B affects the optimization of the strategy for agent A.

Like this, the path of the rewards for agent A does not grow during the simulations to be close to the total endowment.

Results for agent B.

The average action of agent B, as should be expected, remains close to 0.5. This is because since the epsilon value is set to 1, all actions taken by agent B are random for each episode, and since both actions, 0 and 1, have the same probability to be picked, the average decision is the average between the two decisions, that is, the agent does not learn to accept all ofers made by agent A.

Because of the lack of learnig by agent B. The average rewards are a far from the ones in the outcome of the theoretical game.

Learning against rules

Fixed rules

For this part of the exercise, the move method for the class player_2 is modified so agent B can follow a rule to take action instead of learning for itself. More precisely, agent B will accept all offers from agent B that are greater than a defined proportion of the endoment p. The goal in the exercise is to verify if agent A can learn the policy set for agent B and change the strategy accordingly. A total of 50 simulations is run and agent A keeps the parameters at the optimal value.

Results for agent A.

The table and plots below show how the average action of agent A converges to the bound set by agent B for the offer to be accepted. So agent A is succesful in learning the rule and modify its strategy accordingly.

Because of the rule set by agent B and the learning of agent A, the rewards for agent B are limited.

Dynamic Rules

In this exercise, agent B keeps following a rule. However, the min proportion of the endowment that agent B is willing to get is dynamic in time. That means, the proportion shifts according to the number of episodes in the simulation. 50 simulations are run and agent A keeps the parameters in their optimal value.

Results for agent A.

Following the offers of agent A, it is evident that the agent is capable of learning the shifts in the policy of agent B to accept the offers.

The learning process makes the agent take the greedy action in 90% percent of the episodes, however, when the policy for agent B shifts and the agent A keeps taking the greedy action, this one results on a reward of 0, with several mistakes made, the q-table for agent A is sucessfully updated,and agent A modifies the offer. The 0 units rewards can be easily spot in the plot below.